[SPARK-57556][SQL] Raise a clear error for the TIME data type in Hive SerDe interop#56850
[SPARK-57556][SQL] Raise a clear error for the TIME data type in Hive SerDe interop#56850MaxGekk wants to merge 1 commit into
Conversation
|
Could you review when you have a moment? cc @cloud-fan @dongjoon-hyun |
|
@uros-b Could you review this PR, please. |
| assert(e.getMessage.contains("UNSUPPORTED_DATATYPE")) | ||
| assert(e.getMessage.contains(TimeType().sql)) |
There was a problem hiding this comment.
Nit: please use checkError instead of assert/contains.
There was a problem hiding this comment.
I looked into this but checkError does not cleanly apply here. The Hive UDF resolver in HiveSessionStateBuilder.makeFunctionExpression catches the inner UNSUPPORTED_DATATYPE and re-wraps it into a new _LEGACY_ERROR_TEMP_3084 AnalysisException, capturing the original only as the e string parameter (e.toString) and copying its stack trace -- it does not attach it as a cause (the 2-arg AnalysisException(errorClass, messageParameters) constructor sets cause = None). So getCause is null and there is no inner condition to assert against; the only structured option would be checkError on _LEGACY_ERROR_TEMP_3084 with a brittle full-message e param.
I kept assert(...contains...) and filed SPARK-57750 (sub-task of SPARK-37935) to give that legacy error a proper name and attach the cause, after which this can become a clean nested checkError.
There was a problem hiding this comment.
It would be better to merge this #56867 first of all.
| condition = "UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE", | ||
| parameters = Map( | ||
| "columnName" -> "`c`", | ||
| "columnType" -> s"\"${ArrayType(TimeType()).sql}\"", |
There was a problem hiding this comment.
Nit: how about map/struct? Shall we do those too?
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Nit: how about a more direct table-write case, e.g. INSERT INTO <metastore Hive serde table>?
There was a problem hiding this comment.
Good idea, but it turns out a Hive serde table cannot hold a TIME column in the first place, so a managed-table INSERT INTO never reaches the new write-path check. CREATE TABLE t (c TIME) STORED AS PARQUET fails at metastore creation:
AnalysisException: HiveException: IllegalArgumentException:
Error: type expected at the position 0 of 'time(6)' but 'time' is found.
at HiveExternalCatalog.createTable ... at CreateTableCommand.run
HiveClientImpl.toHiveColumn emits the catalog string time(6) as the Hive type, which Hive cannot parse. So the INSERT OVERWRITE DIRECTORY ... STORED AS cases (scalar + nested) are the ones that exercise HiveFileFormat.supportDataType directly; I left this as-is.
| time-zone. | ||
| - `TimeType(precision)`: Represents values comprising values of fields hour, minute and second with the number of decimal digits `precision` following the decimal point in the seconds field, without a time-zone. | ||
| The range of values is from `00:00:00` to `23:59:59` for min precision `0`, and to `23:59:59.999999999` for max precision `9`. The default precision is `6`. | ||
| - Note: Apache Hive has no TIME type, so `TimeType` is not supported in Hive SerDe interop. Storing it in a Hive SerDe table (including `INSERT OVERWRITE DIRECTORY ... STORED AS`) or passing it to a Hive UDF/UDAF/UDTF raises an error rather than silently converting the value. |
There was a problem hiding this comment.
| - Note: Apache Hive has no TIME type, so `TimeType` is not supported in Hive SerDe interop. Storing it in a Hive SerDe table (including `INSERT OVERWRITE DIRECTORY ... STORED AS`) or passing it to a Hive UDF/UDAF/UDTF raises an error rather than silently converting the value. | |
| - Note: Apache Hive has no TIME type, so `TimeType` is not supported in Hive SerDe interop. Storing it in a Hive SerDe table (including `INSERT OVERWRITE DIRECTORY ... STORED AS`) or passing it to a Hive UDF/UDAF/UDTF raises an error rather than silently converting the value. |
|
Thank you for the reviews, @uros-b! Addressed your comments: applied the doc note suggestion, extended the nested-type coverage to map and struct, and replied inline on the direct table-write case and the |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
+1, LGTM. Thank you, @MaxGekk .
|
Follow-up for the |
cloud-fan
left a comment
There was a problem hiding this comment.
0 blocking, 0 non-blocking, 0 nits.
Clean, well-scoped change: replaces a cryptic MatchError with a clear, documented UNSUPPORTED_DATATYPE error, faithfully follows the SPARK-51590 / OrcFileFormat.supportDataType analogue (the case _ => true default is the correct choice for Hive SerDe's broader supported-type set), and is well tested across all three affected surfaces.
One optional, non-blocking observation: toInspector(DataType) still has no default branch, so other unmapped types (e.g. VariantType reached as a Hive UDF argument) would still surface a raw MatchError. Pre-existing and outside this PR's TIME scope -- noting only for awareness, not as a request.
… SerDe interop ### What changes were proposed in this pull request? Apache Hive has no TIME type, so `TimeType` has no faithful representation in Hive SerDe interop. This PR makes `TimeType` produce a clear `AnalysisException` instead of a `scala.MatchError` or an internal error when it reaches the `HiveInspectors` mapping functions, and rejects it in the Hive SerDe write path: - `HiveInspectors.toInspector(dataType)`, `toInspector(expr)` (TIME literal) and `toTypeInfo` now throw `UNSUPPORTED_DATATYPE` via a shared helper. - `HiveFileFormat.supportDataType` rejects `TimeType` (recursing into nested types) so Hive SerDe writes raise `UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE`. ### Why are the changes needed? `HiveInspectors` had no `TimeType` case, so object-inspector creation and TypeInfo mapping fell through to a `MatchError`/internal error when a TIME column or literal reached Hive SerDe paths (e.g. a Hive UDF argument). This makes the behavior explicit and documented, consistent with the existing TIME rejection for Hive ORC (SPARK-51590). ### Does this PR introduce any user-facing change? Yes. Using TIME with Hive UDFs or Hive SerDe writes now fails with a clear error message naming the unsupported TIME type rather than a MatchError/internal error. ### How was this patch tested? Added tests in `HiveInspectorSuite`, `HiveUDFSuite` and `InsertSuite`, and documented the limitation in `docs/sql-ref-datatypes.md`.
|
Merging to master/4.x. Thank you, @dongjoon-hyun @uros-b @cloud-fan for review. |
… SerDe interop ### What changes were proposed in this pull request? Apache Hive has no TIME type, so `TimeType` has no faithful representation in Hive SerDe interop. This PR (the Option B / "clear, documented error" path from [SPARK-57556](https://issues.apache.org/jira/browse/SPARK-57556)) makes `TimeType` produce a clear `AnalysisException` instead of a `scala.MatchError`/internal error when it reaches the `HiveInspectors` mapping functions, and rejects it in the Hive SerDe write path: - `HiveInspectors.toInspector(dataType)`, `toInspector(expr)` (TIME literal) and `toTypeInfo` now throw `UNSUPPORTED_DATATYPE` via a shared `unsupportedHiveType` helper. Previously `toInspector(dataType)` had no `TimeType` case and no default branch, so a TIME column hit a raw `scala.MatchError`. - `HiveFileFormat.supportDataType` rejects `TimeType` (recursing into nested struct/array/map/UDT types, preserving the prior default for all other types) so Hive SerDe writes raise `UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE` (format `Hive`) via `FileFormatWriter.verifySchema`. - Documented the limitation on the TIME entry in `docs/sql-ref-datatypes.md`. ### Why are the changes needed? `HiveInspectors` had no `TimeType` case, so object-inspector creation and TypeInfo mapping fell through to a `MatchError`/internal error when a TIME column or literal reached Hive SerDe paths (for example, a TIME argument to a Hive UDF/UDAF/UDTF). This makes the behavior explicit and documented, consistent with the existing TIME rejection for Hive ORC (SPARK-51590). ### Does this PR introduce _any_ user-facing change? Yes. Using TIME with Hive UDFs or in a Hive SerDe write now fails with a clear error that names the unsupported TIME type, instead of a `MatchError`/internal error. For example, `SELECT myHiveUDF(TIME'12:01:02')` now reports `[UNSUPPORTED_DATATYPE] Unsupported data type "TIME(6)"` (wrapped by the Hive UDF resolver), and writing a TIME column through the Hive SerDe write path reports `[UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE] The Hive datasource doesn't support the column ... of the type "TIME(6)"`. ### How was this patch tested? Added tests and ran them locally (`build/sbt 'hive/testOnly *HiveInspectorSuite *HiveUDFSuite *InsertSuite'`): - `HiveInspectorSuite`: `toInspector(TimeType())`, a TIME literal, and `TimeType().toTypeInfo` raise `UNSUPPORTED_DATATYPE`. - `HiveUDFSuite`: passing `TIME'12:01:02'` to a Hive `GenericUDFHash` fails with a message naming the unsupported TIME type. - `InsertSuite`: `INSERT OVERWRITE LOCAL DIRECTORY ... STORED AS PARQUET SELECT TIME'...'` (with `spark.sql.hive.convertMetastoreInsertDir=false`) raises `UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor (Claude Opus 4.8) Closes #56850 from MaxGekk/time-hive-serde. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit e80f420) Signed-off-by: Max Gekk <max.gekk@gmail.com>
What changes were proposed in this pull request?
Apache Hive has no TIME type, so
TimeTypehas no faithful representation in Hive SerDe interop. This PR (the Option B / "clear, documented error" path from SPARK-57556) makesTimeTypeproduce a clearAnalysisExceptioninstead of ascala.MatchError/internal error when it reaches theHiveInspectorsmapping functions, and rejects it in the Hive SerDe write path:HiveInspectors.toInspector(dataType),toInspector(expr)(TIME literal) andtoTypeInfonow throwUNSUPPORTED_DATATYPEvia a sharedunsupportedHiveTypehelper. PreviouslytoInspector(dataType)had noTimeTypecase and no default branch, so a TIME column hit a rawscala.MatchError.HiveFileFormat.supportDataTyperejectsTimeType(recursing into nested struct/array/map/UDT types, preserving the prior default for all other types) so Hive SerDe writes raiseUNSUPPORTED_DATA_TYPE_FOR_DATASOURCE(formatHive) viaFileFormatWriter.verifySchema.docs/sql-ref-datatypes.md.Why are the changes needed?
HiveInspectorshad noTimeTypecase, so object-inspector creation and TypeInfo mapping fell through to aMatchError/internal error when a TIME column or literal reached Hive SerDe paths (for example, a TIME argument to a Hive UDF/UDAF/UDTF). This makes the behavior explicit and documented, consistent with the existing TIME rejection for Hive ORC (SPARK-51590).Does this PR introduce any user-facing change?
Yes. Using TIME with Hive UDFs or in a Hive SerDe write now fails with a clear error that names the unsupported TIME type, instead of a
MatchError/internal error. For example,SELECT myHiveUDF(TIME'12:01:02')now reports[UNSUPPORTED_DATATYPE] Unsupported data type "TIME(6)"(wrapped by the Hive UDF resolver), and writing a TIME column through the Hive SerDe write path reports[UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE] The Hive datasource doesn't support the column ... of the type "TIME(6)".How was this patch tested?
Added tests and ran them locally (
build/sbt 'hive/testOnly *HiveInspectorSuite *HiveUDFSuite *InsertSuite'):HiveInspectorSuite:toInspector(TimeType()), a TIME literal, andTimeType().toTypeInforaiseUNSUPPORTED_DATATYPE.HiveUDFSuite: passingTIME'12:01:02'to a HiveGenericUDFHashfails with a message naming the unsupported TIME type.InsertSuite:INSERT OVERWRITE LOCAL DIRECTORY ... STORED AS PARQUET SELECT TIME'...'(withspark.sql.hive.convertMetastoreInsertDir=false) raisesUNSUPPORTED_DATA_TYPE_FOR_DATASOURCE.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor (Claude Opus 4.8)